Skip to main content

PAC Learning

Definition (PAC-learnability)

A hypothesis class HH is PAC-learnable if there exists a learning algorithm with the following property: For every ϵ,δ(0,1)\epsilon, \delta \in (0, 1), every distribution DD over XX, and every hHh \in H, when the algorithm is given m(ϵ,δ)m(\epsilon, \delta) samples drawn from DD and labeled by hh, then the algorithm produces a hypothesis h^\hat{h} such that with probability 1δ,L(h^)ϵ1 - \delta, L(\hat{h}) \leq \epsilon. (Note that the probability is over randomness in the training set as well as any internal algorithmic randomness).

Theorem 1 (PAC bound for finite hypothesis class)

Let HH be a hypothesis class with finite size H|H|. Then HH is PAC-learnable with

m(ϵ,δ)=O(log(H/δ)ϵ)\begin{gather*} m(\epsilon, \delta) = O({\log(|H|/\delta) \over \epsilon}) \end{gather*}

The proof is shown in the previous note.
We can solve for m and say that if mlog(H/δ)ϵm \geq {\log(|H|/\delta) \over \epsilon}, the probability of failure in PAC learning is at most δ\delta, and hence we succeed w.p. 1δ1 − \delta.

Agnostic PAC-Learning

  • PAC learning requires the realizibility assumption, meaning that it cannot handle noise / error within the labels.
  • Agnostic PAC learning relaxes the realizability distribution, and also allows the labels to be noisy.
  • PAC learnable implies agnostic PAC learnable. Therefore Agnostic PAC learning is more general than PAC learning.

Definition (Agnostic PAC Learnability)

A hypothesis class HH is agnostic PAC learnable if there exist a function mH:(0,1)2Nm_H : (0, 1)^2 \to \N and a learning algorithm with the following property: For every ϵ,δ(0,1)\epsilon, \delta \in (0, 1) and for every distribution DD over X×YX \times Y, when running the learning algorithm on mmH(ϵ,δ)m \geq m_H(\epsilon, \delta) i.i.d. examples generated by DD, the algorithm returns a hypothesis hh such that, with probability of at least 1δ1 - \delta (over the choice of the mm training examples),

LD(h)minhHLD(h)+ϵ.\begin{gather*} L_D(h) \leq \min_{h' \in H} L_D(h') + \epsilon. \end{gather*}

Definition (ϵ\epsilon-representative sample)

A training set SS is called ϵ\epsilon-representative (w.r.t. domain ZZ, hypothesis class HH, loss function ll, and distribution DD) if

hH,LS(h)LD(h)ϵ.\begin{gather*} \forall h \in H, |L_S(h) - L_D(h)| \leq \epsilon. \end{gather*}

Lemma

Assume that a training set S is ϵ2\epsilon \over 2-representative (w.r.t. domain ZZ, hypothesis class HH, loss function ll, and distribution DD). Then, any output of ERMH(S)ERM_H(S), namely, any hSarg minhHLs(h)h_S \in \argmin_{h \in H}L_s(h), satisfies

LD(hS)minhHLD(h)+ϵ.\begin{gather*} L_D(h_S) \leq \min_{h \in H} L_D(h) + \epsilon. \end{gather*}

Proof

Let hˉ=arg minhHLD(h)\bar{h} = \argmin_{h \in H}L_D(h). Then

LD(hS)LS(hS)+ϵ/2LS(hˉ)+ϵ/2LD(hˉ)+ϵ/2+ϵ/2=LD(hˉ)+ϵ\begin{align*} L_D(h_S) & \leq L_S(h_S) + \epsilon/2 \\ & \leq L_S(\bar{h}) + \epsilon/2 \\ & \leq L_D(\bar{h}) + \epsilon/2 + \epsilon/2 = L_D(\bar{h}) + \epsilon \end{align*}

Definition (Uniform Convergence)

We say that a hypothesis class HH has the uniform convergence property (w.r.t. a domain ZZ and a loss function ll) if there exists a function mHUC:(0,1)2Nm_H^{UC}:(0, 1)^2 \to \mathbb{N} such that for every ϵ,δ(0,1)\epsilon, \delta \in (0, 1) and for every probability distribution DD over ZZ, if SS is a sample of mmHUC(ϵ,δ)m \ge m_H^{UC}(\epsilon, \delta) examples drawn i.i.d. according to DD, then, with probability of at least 1δ1-\delta, SS is ϵ\epsilon-representative.

Corollary

If a class HH has the uniform convergence property with a function mHUCm_H^{UC} then the class is agnostically PAC learnable with the sample complexity mH(ϵ,δ)mHUC(ϵ/2,δ)m_H(\epsilon, \delta) \le m_H^{UC}(\epsilon / 2, \delta). Furthermore, in that case, the ERMHERM_H paradigm is a successful agnostic PAC learner for HH.

Finite Classes are Agnostic PAC Learnable

Let HH be a class H<|H| < \infty. Then HH is agnostic PAC learnable with

mH(ϵ,δ)2log(2H/δ)ϵ2.\begin{gather*} m_H(\epsilon, \delta) \leq \biggl \lceil {2\log(2|H|/\delta) \over \epsilon^2} \biggl \rceil. \end{gather*}

Proof

We will show:

  1. For any fixed hHh \in H and ϵ>0\epsilon > 0,
    Pr[L^(h)L(h)ϵ]12e2nϵ2.\begin{gather*} Pr[|\hat{L}(h) - L(h)| \leq \epsilon] \geq 1 - 2e^{-2n\epsilon^2} \end{gather*}.
  2. For any ϵ>0\epsilon > 0,
    Pr[hH,L^(h)L(h)ϵ]12He2nϵ2.\begin{gather*} Pr[\forall h \in H, |\hat{L}(h) - L(h)| \leq \epsilon] \geq 1 - 2|H|e^{-2n\epsilon^2}. \end{gather*}
  3. For mlog(2H/δ)2ϵ2m \geq \biggl \lceil {\log(2|H|/\delta) \over 2 \epsilon^2} \biggl \rceil, with probability 1δ1 - \delta,
    L^(h)L(h)<ϵhH.\begin{gather*} |\hat{L}(h) - L(h)| < \epsilon \quad \forall h \in H. \end{gather*}
  4. By UC, HH is agnostic PAC learnable with
    m2log(2H/δ)ϵ2\begin{gather*} m \geq \biggl \lceil {2\log(2|H|/\delta) \over \epsilon^2} \biggl \rceil \end{gather*}
    samples.

Proof of (1)

Lemma (Hoeffdings inequality). Let X1,X2,,XnX_1, X_2, \cdots , X_n be independent random variables such that aixibia_i \leq x_i \leq b_i for each i[n]i \in [n]. Then for any ϵ>0\epsilon > 0,

Pr[1mi=1mxiE[1mi=1mxi]ϵ]12exp(2m2ϵ2i=1m(biai)2).\begin{gather*} Pr \Biggl [ \Biggl |{1 \over m} \sum_{i = 1}^{m} x_i - \mathbb{E} \Biggl[{1 \over m} \sum_{i = 1}^{m} x_i \Biggl ] \Biggl | \leq \epsilon \Biggl ] \geq 1 - 2exp \Biggl ({-2m^2 \epsilon^2 \over \sum_{i = 1}^{m} (b_i - a_i)^2} \Biggl ). \end{gather*}

Given Hoeffding’s, we prove (1): take each Xi=L(h(xi),yi)X_i = L(h(x_i), y_i). Since L(h(xi),yi)=1(h(xi)yi),Xi{0,1}L(h(x_i), y_i) = \mathbb{1}(h(x_i) \neq y_i), X_i \in \{0, 1\}, and thus ai=0,bi=1a_i = 0, b_i = 1. Therefore, we have

Pr[1mi=1mxiE[1mi=1mxi]ϵ]12exp(2mϵ2)).\begin{gather*} Pr \Biggl [ \Biggl |{1 \over m} \sum_{i = 1}^{m} x_i - \mathbb{E} \Biggl[{1 \over m} \sum_{i = 1}^{m} x_i \Biggl ] \Biggl | \leq \epsilon \Biggl ] \geq 1 - 2exp(-2m\epsilon^2)). \end{gather*}

Proof of (2)

Pr[hH,L^(h)L(h)ϵ]=1Pr[i=1HL^(h)L(h)>ϵ]1i=1HPr[L^(hi)L(hi)>ϵ]1H(2exp(2mϵ2))\begin{align*} Pr[\forall h \in H, |\hat{L}(h) - L(h)| \leq \epsilon] & = 1 - Pr \Biggl [\bigcup_{i=1}^{|H|} |\hat{L}(h) - L(h)| > \epsilon \Biggl ] \\ & \geq 1 - \sum_{i = 1}^{|H|} Pr[|\hat{L}(h_i) - L(h_i)| > \epsilon] \\ & \geq 1 - |H|(2exp(-2m \epsilon^2)) \end{align*}

Proof of (3)

Set mlog(2H/δ)2ϵ2m \geq \bigg \lceil {\log(2|H|/\delta) \over 2 \epsilon^2} \bigg \rceil in the previous step, then we have the failure probability is bounded by δ\delta for any hHh \in H i.e. Pr[hH,L^(h)L(h)ϵ]1δPr[\forall h \in H, |\hat{L}(h) - L(h)| \leq \epsilon] \geq 1 - \delta.

Proof of (4)

In (3), we have shown that HH has the uniform convergence property, and thus by UC, we have HH is agnostic PAC learnable with mlog(2H/δ)2ϵ2m \geq \bigg \lceil {\log(2|H|/\delta) \over 2 \epsilon^2} \bigg \rceil samples.

Theorem (Bayes Classifier)

Let SS be a finite set. XX has (discrete) distribution Π\Pi over SS. Then the best possible binary classifier is given by

g(x)=sign(E(YX=x)).\begin{gather*} g^*(x) = sign(\mathbb{E}(Y | X = x)). \end{gather*}

y(x)=E(YX=x),Y{+1,1}y(x) = \mathbb{E}(Y | X = x), Y \in \{+1, -1\}. g(x)g^*(x) is known as Bayes Classifier.
I will skip the proof since it's too easy.

Bayes Risk

L=L(g)=xS(1y(x)2)Π(x)\begin{gather*} L^* = L(g^*) = \sum_{x \in S} ({1-|y(x)| \over 2}) \Pi (x) \end{gather*}